Index pairs of a string [Aho-Corasick Automata, Trie]

Time: O(N+M+Z); Space: O(T); easy

Given a text string and words (a list of strings), return all index pairs [i, j] so that the substring text[i]…text[j] is in the list of words.

Example 1:

Input: text = “thestoryofleetcodeandme”, words = [“story”,“fleet”,“leetcode”]

Output: [[3,7],[9,13],[10,17]]

Example 2:

Input: text = “ababa”, words = [“aba”,“ab”]

Output: [[0,1],[0,2],[2,3],[2,4]]

Explanation:

  • Notice that matches can overlap, see “aba” is found in [0,2] and [2,4].

Notes:

  • All strings contain only lowercase English letters.

  • It’s guaranteed that all strings in words are different.

  • 1 <= len(text) <= 100

  • 1 <= len(words) <= 20

  • 1 <= len(words[i]) <= 50

  • Return the pairs [i,j] in sorted order (i.e. sort them by their first coordinate in case of ties sort them by their second coordinate).

1. Naive Approach

A naive approach would be finding all the match of the word in the text and track the position of the hit.

After finding all the match for every word, do some sorting like interval sorting. I found it tricky.

2. Better Approach

We can use the beauty of TreeMap which does natural sorting for us. We will use the starting position as a Key(K) and value will list(List) of endings for all the word. Before creating the output pairs, we can take Key as the first value and Sort the list make pairs with the first value.

Example:

text = “ababa”, words = [“aba”,“ab”]

  1. Take the word “aba” and search it in the text. It will hit the 0th position and 2nd position. so Map would be

    • K V

    • 0 ->2

    • 2 ->4

  2. Now take “ab”, it will hit 0th and 2nd position. Update our map and it will look like below:

    • K V

    • 0 ->2,1

    • 2 ->4,3

  3. As we see that our map is naturally ordered already. So take the keys and sort the value before making the pairs.

    • K V

    • 0 ->1,2

    • 2 ->3,4

  4. So the final result would be:

    • [0,1][0,2][2,3][2,4]

3. Aho-Corasick Automata¶

[9]:
import collections

class AhoNode(object):
    def __init__(self):
        self.children = collections.defaultdict(AhoNode)
        self.indices = []
        self.suffix = None
        self.output = None
[10]:
class AhoTrie(object):

    def step(self, letter):
        while self.__node and letter not in self.__node.children:
            self.__node = self.__node.suffix
        self.__node = self.__node.children[letter] if self.__node else self.__root
        return self.__get_ac_node_outputs(self.__node)

    def __init__(self, patterns):
        self.__root = self.__create_ac_trie(patterns)
        self.__node = self.__create_ac_suffix_and_output_links(self.__root)

    def __create_ac_trie(self, patterns):                # Time:  O(n), Space: O(t)
        root = AhoNode()
        for i, pattern in enumerate(patterns):
            node = root
            for c in pattern:
                node = node.children[c]
            node.indices.append(i)
        return root

    def __create_ac_suffix_and_output_links(self, root):  # Time:  O(n), Space: O(t)
        queue = collections.deque()
        for node in root.children.values():
            queue.append(node)
            node.suffix = root

        while queue:
            node = queue.popleft()
            for c, child in node.children.items():
                queue.append(child)
                suffix = node.suffix
                while suffix and c not in suffix.children:
                    suffix = suffix.suffix
                child.suffix = suffix.children[c] if suffix else root
                child.output = child.suffix if child.suffix.indices else child.suffix.output

        return root

    def __get_ac_node_outputs(self, node):                  # Time:  O(z)
        result = []
        for i in node.indices:
            result.append(i)
        output = node.output
        while output:
            for i in output.indices:
                result.append(i)
            output = output.output
        return result
[11]:
class Solution1(object):
    """
    # Time: O(n + m + z), n is the total size of patterns
    #                   , m is the total size of query string
    #                   , z is the number of all matched strings
    # Space: O(t), t is the total size of ac automata trie
    """
    def indexPairs(self, text, words):
        """
        :type text: str
        :type words: List[str]
        :rtype: List[List[int]]
        """
        result = []
        reversed_words = [w[::-1] for w in words]
        trie = AhoTrie(reversed_words)

        for i in reversed(range(len(text))):
            for j in trie.step(text[i]):
                result.append([i, i+len(reversed_words[j])-1])

        result.reverse()
        return result
[12]:
s = Solution1()

text = "thestoryofleetcodeandme"
words = ["story","fleet","leetcode"]
assert s.indexPairs(text, words) == [[3,7],[9,13],[10,17]]

text = "ababa"
words = ["aba","ab"]
assert s.indexPairs(text, words) == [[0,1],[0,2],[2,3],[2,4]]